Basics of Graphics

Monday, April 10

Today we will…

Tidy Data

Tidy Data

Artwork by Allison Horst

Same Data, Different Formats

Different formats of the data are tidy in different ways.

Team Points Assists Rebounds
A 88 12 22
B 91 17 28
C 99 24 30
D 94 28 31
Team Statistic Value
A Points 88
A Assists 12
A Rebounds 22
B Points 91
B Assists 17
B Rebounds 28
C Points 99
C Assists 24
C Rebounds 30
D Points 94
D Assists 28
D Rebounds 31

Tidy Data

Artwork by Allison Horst

Working with External Data

Common Types of Data Files

Look at the file extension for the type of data file.

.csv : “comma-separated values”

Name, Age
Bob, 49
Joe, 40

.xls, .xlsx: Microsoft Excel spreadsheet

  • Common approach: save as .csv
  • Nicer approach: use the readxl package

.txt: plain text

  • Could have any sort of delimiter…
  • Need to let R know what to look for!

Common Types of Data Files

Loading External Data

Using base R functions:

  • read.csv() is for reading in .csv files.

  • read.table() and read.delim() are for any data with “columns” (you specify the separator).

Loading External Data

The tidyverse has some cleaned-up versions in the readr and readxl packages:

  • read_csv() is for comma-separated data.

  • read_tsv() is for tab-separated data.

  • read_table() is for white-space-separated data.

  • read_delim() is any data with “columns” (you specify the separator). The above are special cases.

  • read_excel() is specifically for dealing with Excel files.

Remember to load the readr and readxl packages first!

Grammar of Graphics

Grammar of Graphics

The Grammar of Graphics (GoG) is a principled way of specifying exactly how to create a particular graph from a given data set. It helps us to systematically design new graphs.


Think of a graph or a data visualization as a mapping…

FROM variables in the data set (or statistics computed from the data)…

TO visual attributes (or “aesthetics”) of marks (or “geometric elements”) on the page/screen.

Why Grammar of Graphics?

  • It’s more flexible than a “chart zoo” of named graphs.
  • The software understands the structure of your graph.
  • It easily automates graphing of data subsets.

ggplot2: elegant graphics for data analysis by Hadley Wickham

The grammar makes it easier for you to iteratively update a plot, changing a single feature at a time. The grammar is also useful because it suggests the high-level aspects of a plot that can be changed, giving you a framework to think about graphics, and hopefully shortening the distance from mind to paper. It also encourages the use of graphics customised to a particular problem, rather than relying on specific chart types.

Components of Grammar of Graphics

  • data: dataframe containing variables
  • aes : aesthetic mappings (position, color, symbol, …)
  • geom : geometric element (point, line, bar, box, …)
  • stat : statistical variable transformation (identity, count, linear model, quantile, …)
  • scale : scale transformation (log scale, color mapping, axes tick breaks, …)
  • coord : Cartesian, polar, map projection, …
  • facet : divide into subplots using a categorical variable

Using ggplot2

How to Build a Graphic

Complete this template to build a basic graphic:


  • We use + to add layers to a graphic.

This begins a plot that you can add layers to:

ggplot(data = mpg)

ggplot(data = mpg, 
       aes(x = class, y = hwy)
       )

ggplot(data = mpg, 
       aes(x = class, y = hwy)
       ) +
  geom_jitter()

ggplot(data = mpg, 
       aes(x = class, y = hwy)
       ) +
  geom_jitter() +
  geom_boxplot()

How would you make the points be on top of the boxplots?

Aesthetics

We map variables (columns) from the data to aesthetics on the graphic useing the aes() function.

What aesthetics can we set (see ggplot2 cheat sheet for more)?

  • x, y
  • color, fill
  • linetype
  • lineend
  • size
  • shape

Aesthetics

We map variables (columns) from the data to aesthetics on the graphic useing the aes() function.

What aesthetics can we set (see ggplot2 cheat sheet for more)?

  • x, y
  • color, fill
  • linetype
  • lineend
  • size
  • shape

Special Properties of Aesthetics

Global Aesthetics

ggplot(data = housingsub, 
       mapping = aes(x = date, 
                     y = median)
       ) +
  geom_point()

Local Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median)
             )

Mapping Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median,
                           color = city)
             )

Setting Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median), 
             color = "blue"
               )

Geometric Objects

Wee use a geom_xxx() function to represent data points.

one variable

  • geom_density()
  • geom_dotplot()
  • geom_histogram()
  • geom_boxplot()

two variable

  • geom_point()
  • geom_line()
  • geom_density_2d()

three variable

  • geom_contour()
  • geom_raster()

Not an exhaustive list – see ggplot2 cheat sheet.

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)
       ) +
  geom_point() +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)
       ) +
  geom_text(aes(label = class)) +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)
       ) +
  geom_line() +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Creating a Graphic

To create a specific type of graphic, we will combine aesthetics and geometric objects.


Let’s try it!

Start with the TX housing data.

Make a plot of median house price over time (including both individual data points and a smoothed trend line ), distinguishing between different cities .

Code
ggplot(data = txhousing, aes(x = date, y = median, color = city)) + 
  geom_point() + 
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Statistical Transformation: stat

A stat transforms an existing variable into a new variable to plot.

  • identity leaves the data as is.
  • count counts the number of observations.
  • summary allows you to specify a desired transformation function.

Sometimes these statistical transformations happen under the hood when we call a geom.

Statistical Transformation: stat

ggplot(data = mpg,
       mapping = aes(x = class)) +
  geom_bar()

ggplot(data = mpg,
       mapping = aes(x = class)) +
  stat_count(geom = "bar")

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = hwy)) +
  stat_summary(geom = "bar",
               fun = "mean") +
  scale_y_continuous(limits = c(0,45))

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = hwy)) +
  stat_summary(geom = "bar",
               fun = "max") +
  scale_y_continuous(limits = c(0,45))

Faceting

Extracts subsets of data and places them in side-by-side graphics.

ggplot(data = mpg, aes(x = cty, y = hwy, color = class)) + 
  geom_point() +
  facet_grid(.~class)

  • facet_grid(. ~ b): facet into columns based on b
  • facet_grid(a ~ .): facet into rows based on a
  • facet_grid(a ~ b): facet into both rows and columns
  • facet_wrap( ~ b): wrap facets into a rectangular layout

You can set scales to let axis limits vary across facets:

facet_grid(y ~ x, scales = ______)

  • "free" – both x- and y-axis limits adjust to individual facets
  • "free_x" – only x-axis limits adjust
  • "free_y" – only y-axis limits adjust

You can set a labeller to adjust facet labels:

  • facet_grid(. ~ fl, labeller = label_both)
  • facet_grid(. ~ fl, labeller = label_bquote(alpha ^ .(x)))
  • facet_grid(. ~ fl, labeller = label_parsed)

Position Adjustements

Position adjustments determine how to arrange geom’s that would otherwise occupy the same space.

  • position = 'dodge': Arrange elements side by side.
  • position = 'fill': Stack elements on top of one another + normalize height.
  • position = 'stack': Stack elements on top of one another.
  • position = 'jitter": Add random noise to X & Y position of each element to avoid overplotting (see geom_jitter()).

Position Adjustements

ggplot(mpg, aes(fl, fill = drv)) + 
  geom_bar(position = "")`

Plot Customizations

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x = "Engine Displacement (liters)", 
       y = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency")

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(xlab = "Engine Displacement (liters)", 
       ylab = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  theme_bw() +
  theme(legend.position = "bottom")

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x     = "Engine Displacement (liters)",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_y_continuous("Highway MPG", 
                     limits = c(0,50),
                     breaks = seq(0,50,5)
                     )

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x    = "Engine Displacement (liters)",
       y    = "Highway MPG",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_color_gradient(low = "white", high = "green4")

Formatting your Plot Code

It is good practice to put each geom and aes on a new line.

  • This makes code easier to read!
  • Generally: no line of code should be over 80 characters long.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) + geom_point() + theme_bw() + labs(x = "City (mpg)", y = "Highway (mpg)")
ggplot(data = mpg, 
       mapping = aes(x = cty, 
                     y = hwy, 
                     color = class)
       ) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "City (mpg)", 
       y = "Highway (mpg)"
       )
ggplot(data = mpg, 
       mapping = aes(x = cty, y = hwy, color = class)
       ) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "City (mpg)", y = "Highway (mpg)")

Let’s Practice!

How would you make this plot from the diamonds dataset in ggplot2?


  • data
  • aes
  • geom
  • facet

Creating a Game Plan

There are a lot of pieces to put together when creating a good graphic.

  • So, when sitting down to create a plot, you should first create a game plan!

This game plan should include:

  1. What data are you starting from?
  2. What are your x- and y-axes?
  3. What type(s) of geom do you need?
  4. What other aes’s do you need?

Use the mpg dataset to create two side-by-side scatterplots of city MPG vs. highway MPG where the points are colored by the drive type (drv). The two ploits should be separated by year.

Code
ggplot(data = mpg,
       mapping = aes(x = cty,
                     y = hwy,
                     color = drv)
       ) +
  geom_point() +
  facet_grid(.~year) +
  labs(x = "city MPG",
       y = "highway MPG")+
  scale_color_discrete(name = "drive type",
                      labels = c("4-wheel","front","rear"))

PA 2: Using Data Visualization to Find the Penguins

Artwork by Allison Horst

To do…

  • PA 2: Using Data Visualization to Find the Penguins
    • Due Wednesday (4/12) at 8:00am
  • Bonus Challenge: FizzBuzz (+5)
    • Due Saturday (4/15) at 11:59pm (optional)
  • Bonus Challenge: Ugly Graphics of Penguins (+5)
    • Due Monday (4/17) at 10:00am (optional)

Wednesday, April 12

Today we will…

  • Review PA 2: Using Data Visualization to Find the Penguins
  • Ugly Graphics of Penguins
  • New Material
    • What makes a good graphic?
  • Lab 2: Exploring Rodents with ggplot2
  • Challenge 2: Spicing things up with ggplot2

Why are some plots easier to read than others?

What makes bad figures bad?

Edward R. Tufte is a better known critic of this style of visualization:

  • Graphical excellence is the well-designed presentation of interesting data and consists of:
    • complex ideas communicated with clarity, precision, and efficiency
    • maximizes the “data-to-ink” ratio.
    • nearly always multivariate
    • requires telling the truth about the data.
  • defines “chartjunk” as superfluous details

bad data.

Looking at pictures of data means looking at lines, shapes, and colors

Our visual system works in a way that makes some things easier for us to see than others

  • “Preattentive” features
  • Gestalt Principles
  • color and contrast

Good Graphics

Graphics consist of:

  • Structure: boxplot, scatterplot, etc.

  • Aesthetics: features such as color, shape, and size that map other characteristics to structural features

Both the structure and aesthetics should help viewers interpret the information.

Gestalt Principles

Gestalt Principles

What sorts of relationships are inferred, and under what circumstances?

  • Proximity: Things that are spatially near to one another are related.
  • Similarity: Things that look alike are related.
  • Enlosure: A group of related elements are surrounded with a visual element
  • Symmetry: If an object is asymmetrical, the viewer will waste time trying to find the problem instead of concentrating on the instruction.
  • Closure: Incomplete shapes are perceived as complete.
  • Continuity: Partially hidden objects are completed into familiar shapes.
  • Connection: Things that are visually tied to one another are related.
  • Figure/Ground: Visual elements are either in the foreground or the background.

Gestalt Principles

Gestalt Hierarchy Graphs
Enclosure Facets
Connection Lines
Proximitiy White Space
Similarity Color/Shape

Implications for practice

  • Know how we perceive groups
  • Know that we perceive some groups before others
  • Design to facilitate and emphasize the most important comparisons

Pre-attentive Features

Pre-attentive Features

Pre-attentive Features

Pre-attentive Features

Pre-Attentive Features are things that “jump out” in less than 250 ms

  • Color, form, movement, spatial localization

There is a hierarchy of features

  • Color is stronger than shape
  • Combinations of pre-attentive features are usually not pre-attentive due to interference

Pre-attentive Features: Double Encoding

Pre-attentive Features: Double Encoding

Color

Color

  • Hue: shade of color (red, orange, yellow…)

  • Intensity: amount of color

  • Both color and hue are pre-attentive. Bigger contrast corresponds to faster detection.

  • Use color to your advantage

  • When choosing color schemes, we will want mappings from data to color that are not just numerically but also perceptually uniform

  • Distinguish between sequential scales and categorical scales

Color: Implications and Guidelines

  • Do not use rainbow color gradient schemes.
  • Avoid any scheme that uses green-yellow-red signaling if you have a target audience that may include colorblind people.
  • To “colorblind-proof” a graphic, you can use a couple of strategies:
    • double encoding - where you use color, use another aesthetic (line type, shape)
    • If you can print your chart out in black and white and still read it, it will be safe for colorblind users. This is the only foolproof way to do it!
    • If you are using a color gradient, use a monochromatic color scheme where possible.
    • If you have a bidirectional scale (e.g. showing positive and negative values), the safest scheme to use is purple - white - orange. In any color scale that is multi-hue, it is important to transition through white, instead of from one color to another directly.
  • Be conscious of what certain colors “mean”

Gradients

No more than 7 colors

Can use colorRampPalette() from the RColorBrewer package to produce larger palettes by interpolating existing ones

Use color gradient with only one hue for positive values

Use color gradient with two hues for positive and negative values. Gradient should go through a light, neutral color (white)

Color in ggplot2

There are packages available for use that have color scheme options.

Some Examples:

  • Rcolorbrewer
  • ggsci
  • viridis
  • wes anderson

There are packages such as RColorBrewer and dichromat that have color palettes which are aesthetically pleasing, and, in many cases, colorblind friendly.

You can also take a look at other ways to find nice color palettes.

Week 2 Assignments

Lab 2: Exploring Rodents with ggplot2

Challenge 2: Spicing things up with ggplot2

To do…

  • Lab 2: Exploring Rodents with ggplot2
    • due Friday, 1/20 at 11:59pm
  • Challenge 2: Spicing things up with ggplot2
    • due Saturday, 1/21 at 11:59pm
  • Read Chapter 3: Data Cleaning and Manipulation
    • Concept Check 3.1 + 3.2 due Monday (1/23) at 8am